e-Chapter 9 


Learning Aides 


Our quest has been for a low out-of-sample error. In the context of specific 
learning models, we have discussed techniques to fit the data in-sample, and 
ensure good generalization to out of sample. There are, however, additional 
issues that are likely to arise in any learning scenario, and a few simple en- 
hancements can often yield a drastic improvement. For example: Did you 
appropriately preprocess the data to take into account arbitrary choices that 
might have been made during data collection? Have you removed any irrel- 
evant dimensions in the data that are useless for approximating the target 
function, but can mislead the learning by adding stochastic noise? Are there 
properties that the target function is known to have, and if so, can these prop- 
erties help the learning? Have we chosen the best model among those that are 
available to us? We wrap up our discussion of learning techniques with a few 
general tools that can help address these questions. 


9.1 Input Preprocessing 


Exercise 9.1 


The Bank of Learning (BoL) gave Mr. Good and Mr. Bad credit cards 
based on their (Age, Income) input vector. 
Mr. Good Mr. Bad 

(Age in years, Income in thousands of $) (47,35) (22,40) 
Mr. Good paid off his credit card bill, but Mr. Bad defaulted. Mr. Unknown 
who has ‘coordinates’ (21yrs,$36K) applies for credit. Should the BoL 
give him credit, according to the nearest neighbor algorithm? If income is 
measured in dollars instead of in “K” (thousands of dollars), what is your 
answer? 


One could legitimately debate whether age (maturity) or income (financial 
resources) should be the determining factor in credit approval. It is, however, 
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e-9. LEARNING AIDES 9.1. INPUT PREPROCESSING 


decidedly not recommended for something like a credit approval to hinge on 
an apparently arbitrary choice made during data collection, such as the de- 
nomination by which income was measured. On the contrary, many standard 
design choices when learning from data intend each dimension to be treated 
equally (such as using the Euclidean distance metric in similarity methods or 
using the sum of squared weights as the regularization term in weight decay). 
Unless there is some explicit reason not to do so, the data should be presented 
to the learning algorithm with each dimension on an equal footing. To do so, 
we need to transform the data to a standardized setting so that we can be 
immune to arbitrary choices made during data collection. 

Recall that the data matrix X € R”*¢ has, as its rows, the input data 


vectors X1,...,Xn, where Xn € R? (not augmented with a 1), 
cone, eee 
— x — 
X= . 
— x —_ 


The goal of input preprocessing is to transform the data x, +> Zn to obtain the 
transformed data matrix Z which is standardized in some way. Let Zn = ®(x,) 
be the transformation. It is the data (Zn, Yn) that is fed into the learning 
algorithm to produce a learned hypothesis g(z).4 The final hypothesis g is 


g(x) = G(®(x)). 


Input Centering. Centering is a relatively benign transformation which 
removes any bias in the inputs by translating the origin. Let x be the in- 
sample mean vector of the input data, x = + Soon Xn; In matrix notation, 
X = X71 (1 is the column vector of N 1’s). To obtain the transformed 
vector, simply subtract the mean from each data point, 


Zn = Xn — X. 
By direct calculation, one can verify that Z = X — 1x". Hence, 


Z= ĄZ"1 = 5X7 1— 4x1"1 =3ž- 4x. N=0, 

where we used 171 = N and the definition of x. Thus, the transformed 
vectors are ‘centered’ in that they have zero mean, as desired. It is clear that 
no information is lost by centering (as long as one has retained X, one can 
always recover x from z). If the data is not centered, we can always center it, 
so from now on, for simplicity and without loss of generality, we will assume 


that the input data is centered, and so X71 = 0. 


1Note that higher-level features or nonlinear transforms can also be used to transform 
the inputs before input processing. 
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Exercise 9.2 
Define the matrix y = I — 4117. Show that Z = 7X. (y is called 


the centering operator (see Appendix B.3) which projects onto the space 
orthogonal to 1.) 


Input Normalization. Centering alone will not solve the problem that 
arose in Exercise 9.1. The issue there is one of scale, not bias. Inflating 
the scale of the income variable exaggerates differences in income when using 
the standard Euclidean distance, thereby affecting your decision. One solution 
is to ensure that all the input variables have the same scale. One measure of 
scale (or spread) is the standard deviation. Since the data is now centered, 
the in-sample standard deviation o; of input variable 7 is defined by 


N 
2 1 2 
7; = N Thi» 
n=1 


where £n; is the ith component of the nth data point. Input normalization 
transforms the inputs so that each input variable has unit standard deviation 
(scale). Specifically, the transformation is 


Zn1 Tni / O71 
Zn = : =A : e= Dxn, 
Znd Lna/ Ca 
where D is a diagonal matrix with entries D;; = 1/o;. The scale of all input 


variables is now 1, since 6? = 1 as the following derivation shows (4? = 1 is 
the variance, or scale, of dimension 7 in the Z space). 





Exercise 9.3 


Consider the data matrix X and the transformed data matrix Z. Show that 


Z= XD and ZZ = DX*XD. 


Input Whitening Centering deals with bias in the inputs. Normalization 
deals with scale. Our last concern is correlations. Strongly correlated input 
variables can have an unexpected impact on the outcome of learning. For 
example, with regularization, correlated input variables can render a friendly 
target function unlearnable. The next exercise shows that a simple function 
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may require excessively large weights to implement if the inputs are correlated. 
This means it is hard to regularize the learning, which in turn means you 
become susceptible to noise and overfitting. 


Exercise 9.4 


Let ĉı and ĉ2 be independent with zero mean and unit variance. You 
measure inputs xı = 21 and z2 = V1 — e741 + ef. 

(a) What are variance(x1), variance(#2) and covariance(x1, x2)? 

(b) Suppose f(x) = #141 + Wee (linear in the independent variables). 
Show that f is linear in the correlated inputs, f(x) = wiai + wee. 
(Obtain wi, w2 as functions of ù, w2.) 

(c) Consider the ‘simple’ target function f(x) = #1 + ĉ2. If you perform 
regression with the correlated inputs x and regularization constraint 


wi +ws < C, what is the maximum amount of regularization you can 
use (minimum value of C) and still be able to implement the target? 


(d) What happens to the minimum C as the correlation increases (e — 0). 


(e) Assuming that there is significant noise in the data, discuss your re- 
sults in the context of bias and var. 


The previous exercise illustrates that if the inputs are correlated, then the 
weights cannot be independently penalized, as they are in the standard form 
of weight-decay regularization. If the measured input variables are correlated, 
then one should transform them to a set that are uncorrelated (at least in- 
sample). That is the goal of input whitening.? 

Remember that the data is centered, so the in-sample covariance matrix is 


— 1 
y= rp oa =ar (9.1) 


Dij = cov(x;, xj) is the in-sample covariance of inputs i and j; Di; = o? is 


the variance of input 7. Assume that X has full rank and let its matrix square 
1 1 1 
root be X2, which satisfies 222 = X (see Problem 9.3 for the computation 


3 


of £2). Consider the whitening transformation 
zg 
Zn =% ?Xn, 


where X7? is the inverse of £2. In matrix form, Z = Xd-2. Z is whitened if 
HZ =I. We verify that Z is whitened as follows: 


1 _-ifl a AD Sel oo eit ares 
Bay Ae yee a (=> o> | 57? = 57255 = (2 2z) (zx +) =1, 
N N 





2The term whitening is inherited from signal processing where white noise refers to a 
signal whose frequency spectrum is uniform; this is indicative of a time series of independent 
noise realizations. The origin of the term white comes from white light which is a light signal 
whose amplitude distribution is uniform over all frequencies. 
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9.1. INPUT PREPROCESSING 


where we used © = ©2D2 and E2573 = D72? = I. Thus, for the trans- 
formed inputs, every dimension has scale 1 and the dimensions are pairwise 
uncorrelated. Centering, normalizing and whitening are illustrated on a toy 
data set in Figure 9.1. 
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raw data 
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Figure 9.1: Illustration of centering, normalization and whitening. 


It is important to emphasize that input preprocessing does not throw away 
any information because the transformation is invertible: the original inputs 
can always be recovered from the transformed inputs using x and X. 














WARNING! Transforming the data to a more convenient 
format has a hidden trap which easily leads to data snooping. 


If you are using a test set to estimate your performance, 
make sure to determine any input transformation only us- 
ing the training data. A simple rule: the test data should 
be kept locked away in its raw form until you are ready to 


test your final hypothesis. (See Example 5.3 for a concrete 


illustration of how data snooping can affect your estimate 
of the performance on your test set if input preprocess- 
ing in any way used the test set.) After you determine 
your transformation parameters from the training data, 
you should use these same parameters to transform your 
test data to evaluate your final hypothesis g. 
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9.2 Dimension Reduction and Feature Selection 


The curse of dimensionality is a general observation that statistical tasks 
get exponentially harder as the dimensions increase. In learning from data, 
this manifests itself in many ways, the most immediate being computational. 
Simple algorithms, such as optimal (or near-optimal) k-means clustering, or 
determining the optimal linear separator, have a computational complexity 
which scales exponentially with dimensionality. Furthermore, a fixed number, 
N, of data points only sparsely populates a space whose volume is growing 
exponentially with d. So, simple rules like nearest neighbor get adversely 
affected because a test point’s ‘nearest’ neighbor will likely be very far away 
and will not be a good representative point for predicting on the test point. 
The complexity of a hypothesis set, as could be measured by the VC dimension, 
will typically increase with d (recall that, for the simple linear perceptron, the 
VC dimension is d+ 1), affecting the generalization from in-sample to out- 
of-sample. The bottom line is that more data are needed to learn in higher- 
dimensional input spaces. 


Exercise 9.5 


Consider a data set with two examples, 


(xi = [-1,a1,..., aa], yi = +1); Ca = [ll Oige all. y2 ——1), 
where ai,b; are independent random +1 variables. eta, 
[-1,—1,...,—1]. Assume that only the first component of x is relevant 


to f. However, the actual measured x has additional random components 
in the additional d dimensions. If the nearest neighbor rule is used, show, 
either mathematically or with an experiment, that the probability of classi- 
fying Xtest Correctly is Beno Se) (d is the number of irrelevant dimensions). 


What happens if there is a third data point (x3 = [1, c1, . . . , Ca], y3 = —1)? 


The exercise illustrates that as you have more and more spurious (random) 
dimensions, the learned final hypothesis becomes useless because it is domi- 
nated by the random fluctuations in these spurious dimensions. Ideally, we 
should remove all such spurious dimensions before proceeding to the learning. 
Equivalently, we should retain only the few informative features. For the digits 
data from Chapter 3, we were able to obtain good performance by extracting 
just two features from the raw 16 x 16-pixel input image (size and symmetry), 
a dimension reduction from 256 raw features to 2 informative ones. 
The features z are simply a transformation of the input x, 


z = ®(x), 


where the number of features is the dimension of z. If the dimension of z is 
less than the dimension of x, then we have accomplished dimension reduction. 
The ideal feature is the target function itself, z = f(x), since if we had this 
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feature, we are done. This suggests that quality feature selection may be as 
hard as the original learning problem of identifying f. 


We have seen features and feature transforms many times before, for ex- 
ample, in the context of linear models and the non-linear feature transform 
in Chapter 3. In that context, the non-linear feature transform typically in- 
creased the dimension to handle the fact that the linear hypothesis was not 
expressive enough to fit the data. In that setting, increasing the dimension 
through the feature transform was attempting to improve Ein, and we did pay 
the price of poorer generalization. The reverse is also true. If we can lower 
the dimension without hurting Ein (as would be the case if we retained all the 
important information), then we will also improve generalization. 


9.2.1 Principal Components Analysis (PCA) 


Centering, scaling and whitening all attempt to correct for arbitrary choices 
that may have been made during data collection. Feature selection, such 
as PCA, is conceptually different. It attempts to get rid of redundancy or 
less informative dimensions to help, among other things, generalization. For 
example, the top right pixel in the digits data is almost always white, so it 
is a dimension that carries almost no information. Removing that dimension 
will not hurt the fitting, but will improve generalization. 


PCA constructs a small number of linear features to summarize the input 
data. The idea is to rotate the axes (a linear transformation that defines a new 
coordinate system) so that the important dimensions in this new coordinate 
system become self evident and can be retained while the less important ones 
get discarded. Our toy data set can help crystallize the notion. 


22 


Rotated Data 





Original Data 





Once the data are rotated to the new ‘natural’ coordinate system, zı stands 
out as the important dimension of the transformed input. The second dimen- 
sion, Z2, looks like a bunch of small fluctuations which we ought to ignore, 
in comparison to the apparently more informative and larger z1. Ignoring 
the z2 dimension amounts to setting it to zero (or just throwing away that 
coordinate), producing a 1-dimensional feature. 
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Exercise 9.6 


Try to build some intuition for what the rotation is doing by using the 
illustrations in Figure 9.1 to qualitatively answer these questions. 


(a) If there is a large offset (or bias) in both measured variables, how 
will this affect the ‘natural axes’, the ones to which the data will 
be rotated? Should you perform input centering before doing PCA? 





Hint: Consider a versus #7 


(b) If one dimension (say xı) is inflated disproportionately (e.g., income 
is measured in dollars instead of thousands of dollars). How will this 
affect the ‘natural axes’, the ones to which the data should be rotated? 
Should you perform input normalization before doing PCA? 


(c) If you do input whitening, what will the ‘natural axes’ for the inputs 
be? Should you perform input whitening before doing PCA? 


What if the small fluctuations in the zə direction were the actual important 
information on which f depends, and the large variability in the zı dimension 
are random fluctuations? Though possible, this rarely happens in practice, 
and if it does happen, then your input is corrupted by large random noise and 
you are in trouble anyway. So, let’s focus on the case where we have a chance 
and discuss how to find this optimal rotation. 
Intuitively, the direction v captures the largest 
fluctuations in the data, which could be measured 
by variance. If we project the input x, onto v to 
get zn = V’Xn, then the variance of z is + San z2 
(remember x,, and hence zn have zero mean). 





N N 
N 1 2 a 1 T T 
var[z] = W mm = ye DLV XnXnv 
n=1 n=1 
N 
T 1 i i 
= v Ta XnX,, | V 
n=1 
= vv. 


To maximize var[z], we should pick v as the top eigenvector of ©, the one with 
the largest eigenvalue. Before we get more formal, let us address an apparent 
conflict of interest. Whitening is a way to put your data into a spherically 
symmetric form so that all directions are ‘equal’. This is recommended when 
you have no evidence to the contrary; you whiten the data because most 
learning algorithms treat every dimension equally (nearest neighbor, weight 
decay, etc.). PCA, on the other hand is highlighting specific directions which 
contain more variance. There is no use doing PCA after doing whitening, 
since every direction will be on an equal footing after whitening. You use 
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PCA precisely because the directions are not to be treated equally. PCA 
helps to identify and throw away the directions where the fluctuations are a 
result of small amounts of noise. After deciding which directions to throw 
away, you can now use whitening to put all the retained directions on an equal 
footing, if you wish. 

Our visual intuition works well in 2-dimensions, but in higher dimension, 
when no single direction captures most of the fluctuation, we need a more 
principled approach, starting with a mathematical formulation of the task. 
We begin with the observation that a rotation of the data exactly corresponds 
to representing the data in a new (rotated) coordinate system. 


Coordinate Systems A coordinate system is defined by an orthonormal 
basis, a set of mutually orthogonal unit vectors. The standard Euclidean coor- 
dinate system is defined by the Euclidean basis in d dimensions, u1, U2,..., Ud, 
where u; is the ith standard basis vector which has a 1 in coordinate 7 and 0 
for all other coordinates. The input vector x has the components x; = x™uj, 
and we can write 

d 


d 
x= X niu; = X @&"u;)u;. 
i=1 


i=1 


This can be done for any orthonormal basis v1,...,Va, 


where the coordinates in the basis vi,..., Va are z; = (x"v;). The goal of PCA 
is to construct a more intuitive basis where some (hopefully the majority) of 
the coordinates are small and can be treated as small random fluctuations. 
These coordinates are going to be discarded, i.e., set to zero. The hope is that 
we have reduced the dimensionality of the problem while retaining most of the 
important information. 


So, given the vector x (in the standard coordinate system) and some other 
coordinate system v1,...,Va, we can define the transformed feature vector 
whose components are the coordinates z,,...,Zq in this new coordinate sys- 
tem. Suppose that the first k < d of these transformed coordinates are the 
informative ones, so we throw away the remaining coordinates to arrive at our 
dimension-reduced feature vector 


ZL X V1 


Zk X" VE 
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Exercise 9.7 


(a) Show that z is a linear transformation of x, z = V"x. What are the 
dimensions of the matrix V and what are its columns? 


(b) Show that the transformed data matrix is Z = XV. 
(c) Show that 2; 2? = 0“, x? and hence that ||z|| < |||]. 


If we kept all the components z1,..., 2a, then we can reconstruct x via 


Using only the first k components, the best reconstruction of x is 


k 
X= y ZiVi. 
i=l 


We have lost that part of x represented by the trailing coordinates of z. The 
magnitude of the part we lost is captured by the reconstruction error 














d d 
n2 2 
|x — x||" = J zivil = ) 2; 
i=k+1 i=k+1 
(because v1,...,Va are orthonormal). The new coordinate system is good if 


the sum of the reconstruction errors over the data points is small. That is, if 


N 
a [xn — žnl” 
n=1 


is small. If the x, are reconstructed with small error from Zp (i.e. Rn ~ Xn), 
then not much information was lost. PCA finds a coordinate system that 
minimizes this total reconstruction error. The trailing dimensions will have 
the least possible information, and so even after throwing away those trailing 
dimensions, we can still almost reconstruct the original data. PCA is optimal, 
which means that no other linear method can produce coordinates with a 
smaller reconstruction error. The first k basis vectors, vi1,...,V%, of this 
optimal coordinate basis are called the top-k principal directions. 

So, how do we find this optimal coordinate basis v1,...,Va (of which we 
only need v1,...,Vẹ to compute our dimensionally reduced feature)? The 
solution to this problem has been known since 1936 when the remarkable 
singular value decomposition (SVD) was invented. The SVD is such a useful 
tool for learning from data that time spent mastering it will pay dividends 
(additional background on the SVD is given in Appendix B.2). 
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The Singular Value Decomposition (SVD). Any matrix, for example, 
our data matrix X, has a very special representation as the product of three 
matrices. Assume that X € R“*¢ with N > d (a similar decomposition holds 
for X7 if N < d). Then, 

X = UTV" 


where U € R”*4 has orthonormal columns, V € R2*4 is an orthogonal matrix? 
and I is a non-negative diagonal matrix. The diagonal elements y; = Tj, 
where 71 > %2 > +- > Ya È 0, are the singular values of X, ordered from 
largest to smallest. The number of non-zero singular values is the rank of X, 
which we will assume is d for simplicity. Pictorially, 


\ ` 


x T x (dxd) 


(n x d) = (nx d) aN 


















































The matrix U contains (as its columns) the left singular vectors of X, and 
similarly V contains (as its columns) the right singular vectors of X. Since U 
consists of orthonormal columns, UTU = Ig. Similarly, VTV = VV" = Ig. If X 
is square, then U will be square. Just as a square matrix maps an eigenvector 
to a multiple of itself, a more general non-square matrix maps a left (resp. 
right) singular vector to a multiple of the corresponding right (resp. left) 
singular vector, as verified by the following identities: 


UX=IV"?; XV=UT. 


It is convenient to have column representations of U and V in terms of the 
singular vectors, U = [u1,...,ua] and V = [v1,..., Val. 


Exercise 9.8 
Show U*X = TV" and XV = UT, and hence X*u; = yiv; and Xv; = yiu. 


(The ith singular vectors and singular value (ui, vi, y¿) play a similar role 
to eigenvector-eigenvalue pairs.) 


Computing the Principal Components via SVD. It is no coincidence 
that we used v; for the right singular vectors of X and v; in the mathematical 
formulation of the PCA task. The right singular vectors vi,...,Vq are our 
optimal coordinate basis so that by ignoring the trailing components we incur 
the least reconstruction error. 

3An orthogonal matrix is a square matrix with orthonormal columns, and therefore its 
inverse is the same as its transpose. So, V'V=VVT=I. 

4In the traditional linear algebra literature, I is typically denoted by © and its diagonal 


elements are the singular values 0, > --- > og. We use I’ because we have reserved X for 
the covariance matrix of the input distribution and o? for the noise variance. 
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Theorem 9.1 (Eckart and Young, 1936). For any k, vi,...,vx (the top- 
k right singular vectors of the data matrix X) are a set of top-k principal 


ir : f od 
component directions and the optimal reconstruction error is J izp41 7/7: 


The components of the dimensionally reduced feature vector are z; = xv; 
ma k ae : : 
and the reconstructed vector is X = }>;_, 2;Vi, which in matrix form is 


X = XV;.Vj, (9.2) 


where Vk = [v1,..., Vk] is the matrix of top-k right singular vectors of X. 


PCA Algorithm: 
Inputs: The centered data matrix X and k > 1. 


1: Compute the SVD of X: [U,T, V] = svd(X). 


2: Let Vk = [vi,.--, Vk] be the first k columns of V. 
3: The PCA-feature matrix and the reconstructed data are 





Z=XV;,, XeSiV,V7. 


Note that PCA, along with the other input pre-processing tools (centering, 
rescaling, whitening) are all unsupervised - you do not need the y-values. The 
Eckart- Young theorem is quite remarkable and so fundamental in data analysis 
that it certainly warrants a proof. 


Begin safe skip: You may skip the proof without 
compromising the logical sequence. A similar green 





box will tell you when to rejoin. 


We will need some matrix algebra preliminaries which are useful general tools. 
Recall that the reconstructed data matrix X is similar to the data matrix, hav- 
ing the reconstructed input vectors X,, as its rows. The Frobenius norm |All 
of a matrix A € RY*4 is the analog of the Euclidean norm, but for matrices: 


N d N d 
def 
Al = >> Y AZ, = X [rown (A)|? = XE |]column,(A)||”. 
n=1 i=l 


n=1 i=1 
The reconstruction error is exactly the Frobenius norm of the matrix difference 


wid 
between the original and reconstructed data matrices, ||X — X|| p. 


Exercise 9.9 


Consider an arbitrary matrix A, and any matrices U, V with orthonormal 
columns (U°U = I and V"V = 1). 


(a) Show that ||A|| = trace(AA™) = trace(A7A). 


(b) Show that ||UAV"||; = ||A||Z (assume all matrix products exist). 
[Hint: Use part (a).] 
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Proof of the Eckart- Young Theorem. The preceding discussion was for general 
A. We now set A to be any orthonormal basis A, with columns aj,..., aq, 


A =l[a,...,aal. 
Since V is an orthonormal basis, we can write 
A= VĒ, 
where W = [Y1,..., Ya]. Since 
I= AA = Y"V"V Y = Y" Y, 


we see that W is orthogonal. Suppose (without loss of generality) that we will 
use the first k basis vectors of A for reconstruction. So, define 


Ax = [ai,..., ax] = VWe, 
where Yy = [Y1,..., Yr]. We use A; to approximate X using the reconstruc- 
tion X = XA, A7 from (9.2). Then, 
|X-Xp = IX AKAA 
= |UPV"—-UPV'V8, YEV" ||} 
= UG -rE 8{)Vv" |; 
= |PC-.0))Il-, 


where we have used VTV = I4 and Exercise 9.9(b). By Exercise 9.9(a), 





|A- ©, 07) ||, = traceT (I — Y 87/72). 
Now, using the linearity and cyclic properties of the trace and the fact that 
I— PẸ} is a projection, 
trace (T(I— W,W{)°T) = trace (T(1— ¥, WZ )T) 
trace(T?) — trace (TW, YIT) 
= trace(I) — trace (Pr? Y). 


The first term is independent of Wg, so we must maximize the second term. 
Since Wy has k orthonormal columns, ||W,;||j = k (because the Frobenius 
norm is the sum of squared column norms). Let the rows of P, be q7,...,q. 
Then 0 < ||q;||? < 1 (P has orthonormal rows and q; are truncated rows of Y), 
and ee \|qi||? = k (because the Frobenius norm is the sum of squared row- 
norms). We also have that 


oi ] [a a 


P z | Bi =X Fadi. 
va) laa = 
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Taking the trace and using linearity of the trace gives 
d 
2 
trace(W7I?W,) = X allal’. 
i=1 


To maximize trace(W7T?W;,), the best possible way to spend our budget of k 
for $; \|qi||? is to put as much as possible into ||qı|| and then ||q2|| and so on, 
because yı > y2 > ---. This will result in the ||qu||” =- = ||qg||7 = 1 and 
all the remaining q; = 0. Thus, 


k 
trace(W7I7W,) < 5 y. 
i=1 


We conclude that 


IX- Å|% = trace(T?)— trace (YIT?W,) 
d 
= 5 y? — trace (Pr? Pg) 
i=1 
d k 
2 D- 
i=1 i=1 
d 
-\ Z 
i=k+1 
If A = V so that & = I, then indeed ||qi||? = --- = ||qx||? = 1 and we attain 
equality in the bound above, showing that the top-k right singular vectors of X 
do indeed give an optimal basis, which concludes the proof. E 


The proof of the theorem also shows that the optimal reconstruction error is 


a2 
the sum of squares of the trailing singular values of X, ||X — X|| p = wae y2. 


End safe skip: Those who skipped the proof are now 

rejoining us for some examples of PCA in action. 
Example 9.2. It is instructive to see PCA in action. The digits training data 
matrix has 7291 (16 x 16)-images (d = 256), so X € R7?91x256. Let x be the 
average image. First center the data, setting Xn <_ Xn — X, to get a centered 
data matrix X. 

Now, let’s compute the SVD of this centered data matrix, X = UTV". 

To do so we may use a built in SVD package that is pretty much standard 


on any numerical platform. It is also possible to only compute the top-k 
singular vectors in most SVD packages to obtain Uk, I k, Vz, where U, and Vz, 
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% Reconstruction Error 





0 50 100 150 200 
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(a) Reconstruction error (b) Top-2 PCA-features 


Figure 9.2: PCA on the digits data. (a) Shows how the reconstruction 
error depends on the number of components k; about 150 features suffice to 
reconstruct the data almost perfectly. If all principal components are equally 
important, the reconstruction error would decrease linearly with k, which is 
not the case here. (b) Shows the two features obtained using the top two 
principal components. These features look as good as our hand-constructed 
features of symmetry and intensity from Example 3.5 in Chapter 3. 


contain only the top-k singular vectors, and I% the top-k singular values. The 
reconstructed matrix is 


Å = XV VI = Ukr Vi. 


By the Eckart-Young theorem, X is the best rank-k approximation to X. Let’s 
look at how the reconstruction error depends on k, the number of features used. 
To do this, we plot the reconstruction error using k principle components as 
a percentage of the reconstruction error with zero components. With zero 
components, the reconstruction error is just IXIIŻ. The result is shown in 
Figure 9.2(a). As can be seen, with just 50 components, the reconstruction 
error is about 10%. This is a rule of thumb to determine how many components 
to use: choose k to obtain a reconstruction error of less than 10%. This 
amounts to treating the bottom 10% of the fluctuations in the data as noise. 

Let’s reduce the dimensionality to 2 by projecting onto the top two prin- 
cipal components in V2. The resulting features are shown in Figure 9.2(b). 
These features can be compared to the features obtained using intensity and 
symmetry in Example 3.5 on page 106. The features appear to be quite good, 
and suitable for solving this learning problem. Unlike the size and intensity 
features which we used in Example 3.5, the biggest advantage of these PCA 
features is that their construction is fully automated — you don’t need to know 
anything about the digit recognition problem to obtain them. This is also their 
biggest disadvantage — the features are provably good at reconstructing the 
input data, but there is no guarantee that they will be useful for solving the 
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learning problem. In practice, a little thought about constructing features, 
together with such automated methods to then reduce the dimensionality, 
usually works best. 

Finally, suppose you use the feature vector z in Figure 9.2(b) to build a 
classifier g(z) to distinguish between the digit 1 and all the other digits. The 
final hypothesis to be applied to a test input Xtest is 


g(Xtest ) = IV? (Xtest a X)), 


where g, V2 and X were constructed from the data: V2 and X from the data 
inputs, and g from the targets and PCA-reduced inputs (Z, y) 














We end this section by justifying the “maximize the variance” intuition 
that began our discussion of PCA. The next exercise shows that the principal 
components do indeed represent the high variance directions in the data. 


Exercise 9.10 


Assume that the data matrix X is centered, and define the covariance matrix 
X= ~X"X. Assume all the singular values of X are distinct. (What does 
this mean for the eigenvalues of £?) For a potential principal direction v, 
we defined zn = x;,v and showed that var(z1,...,2N) =v Dv. 
(a) Show that the direction which results in the highest variance is vı, 
the top right singular vector of X. 
(b) Show that we obtain the top-k principal directions, vi,...,Vx, by 
selecting k directions sequentially, each time obtaining the direction 
with highest variance that is orthogonal to all previously selected di- 
rections. 
This shows that the top-k principal directions are the directions of 
highest variance. 


(c) If you don’t the data matrix X, but only know the covariance matrix ©, 
can you obtain the principal directions? If so, how? 


9.2.2 Nonlinear Dimension Reduction 


Figure 9.3 illustrates one thing that can go wrong with PCA. You can see that 
the data in Figure 9.3(a) approximately lie on a one-dimensional surface (a 
curve). However, when we try to reconstruct the data using top-1 PCA, the 
result is a disaster, shown in Figure 9.3(b). This is because, even though the 
data lie on a curve, that curve is not linear. PCA can only construct linear 
features. If the data do not live on a lower dimensional ‘linear manifold’, then 
PCA will not work. If you are going to use PCA, here is a checklist that will 
help you determine whether PCA will work. 


1. Do you expect the data to have linear structure, for example does the data 
lie in a linear subspace of lower dimension? 
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£1 
(a) Data in ¥ space (b) Top-1 PCA reconstruction 


Figure 9.3: (a) The data in ¥ space does not ‘live’ in a lower dimensional 
linear manifold. (b) The reconstructed data using top-1 PCA data must lie 
on a line and therefore cannot accurately represent the original data. 


2. Do the bottom principal components contain primarily small random fluc- 
tuations that correspond to noise and should be thrown away? The fact 
that they are small can be determined by looking at the reconstruction 
error. The fact that they are noise is not much more than a guess. 


3. Does the target function f depend primarily on the top principal compo- 
nents, or are the small fluctuations in the bottom principal components key 
in determining the value of f? If the latter, then PCA will not help the 
machine learning task. In practice, it is difficult to determine whether this 
is true (without snooping ©)). A validation method can help determine 
whether to use PCA-dimension-reduction or not. Usually, throwing away 
the lowest principal components does not throw away significant informa- 
tion related to the target function, and what little it does throw away is 
made up for in the reduced generalization error bar because of the lower 
dimension. 


Clearly, PCA will not work for our data in Figure 9.3. However, we are not 
dead yet. We have an ace up our sleeve, namely the all-powerful nonlinear 
transform. Looking at the data in Figure 9.3 suggests that the angular co- 
ordinate is important. So, lets consider a transform to the nonlinear feature 
space defined by polar coordinates. 


| æ |r} _ 
v2 0 E 
The data using polar-coordinates is shown in Figure 9.4(a). In this space, the 


data clearly lie on a linear subspace, appropriate for PCA. The top-1 PCA 
reconstructed data (in the nonlinear feature space) is shown in Figure 9.4(b). 


yr? +23 


tan” *(22) 
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r= f/x? +03 r 


(a) Transformed data in Z space (b) Reconstruction in Z space 


Figure 9.4: PCA in a nonlinear feature space. (a) The transformed data 
in the Z space are approximately on a linear manifold. (b) Shows nonlinear 
reconstruction of the data in the nonlinear feature space. 


We can obtain the reconstructed data in the original X space by transforming 
the red reconstructed points in Figure 9.4(b) back to Y space, as shown below. 





Exercise 9.11 


Using the feature transform ® : [73] | za |: you have run top-1 PCA 
£1 +T2 
on your data Zı,...,Zn in Z space to obtain Vi = [e] and Z = 0. 


For the test point x = |} ], compute z, Z, x. 


(z is the test point in Z space; ĉ is the reconstructed test point in Z space 
using top-1 PCA; x is the reconstructed test point in ¥ space.) 


Exercise 9.11 illustrates that you may not always be able to obtain the recon- 
structed data in your original ¥ space. For our spiral example, we can obtain 
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the reconstructed data in the ¥ space because the polar feature transform 
is invertible; invertibility of ® is essential to reconstruction in ¥ space. In 
general, you may want to use a feature transform that is not invertible, for 
example the 2nd-order polynomial transform. In this case, you will not be 
able to reconstruct your data in the ¥ space, but that need not prevent you 
from using PCA with the nonlinear feature transform, if your goal is predic- 
tion. You can transform to your Z space, run PCA in the Z space, discard 
the bottom principal components (dimension reduction), and run your learn- 
ing algorithm using the top principal components. Finally, to classify a new 
test point, you first transform it, and then classify in the Z space. The entire 
work-flow is summarized in the following algorithm. 


PCA with Nonlinear Feature Transform: 

Inputs: The data X,y; k > 1; and, transform ®. 
1: Transform the data: Zn = ®(x,). 
: Center the data: Zn + Zn — Z where Z = 4 Da. Zn: 
: Obtain the centered data matrix Z whose rows are Zn. 
: Compute the SVD of Z: [U, T, V] = svd(Z). 


: Let Vk = [v1,..., Vk] be the first k columns of V. 

: Construct the top-k PCA-feature matrix Z, = ZVę. 
: Use the data (Zz, y) to learn a final hypothesis g. 

: The final hypothesis is 





There are other approaches to nonlinear dimension reduction. Kernel-PCA is 
a way to combine PCA with the nonlinear feature transform without visiting 
the Z space. Kernel-PCA uses a kernel in the ¥ space in much the same 
way that kernels were used to combine the maximum-margin linear separator 
with the nonlinear transform to get the Kernel-Support Vector Machine. Two 
other popular approaches are the Neural-Network auto-encoder (which we dis- 
cussed in the context of Deep Learning with Neural Networks in Section 7.6) 
and nonlinear principal curves and surfaces which are nonlinear analogues to 
the PCA-generated linear principal surfaces. With respect to reconstructing 
the data onto lower dimensional manifolds, there are also nonparametric tech- 
niques like the Laplacian Eigenmap or the Locally Linear Embedding (LLE). 
The problems explore some of these techniques in greater depth. 


9.3. Hints And Invariances 
Hints are tidbits of information about the target function that you know ahead 


of time (before you look at any data). You know these tidbits because you 
know something about the learning problem. Why do we need hints? If we had 
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an unlimited supply of data (input-output examples), we wouldn’t need hints 
because the data contains all the information we need to learn.’ In reality, 
though, the data is finite, and so not all properties about f may be represented 
in the data. Common hints are invariances, monotonicities and symmetries. 
For example, rotational invariance applies in most vision problems. 


Exercise 9.12 

Consider the following three hypothesis sets, 
H+ = {h|h(x) =sign(wx+ wo); w > 0}, 
H- = {hlh(x) =sign(wx + wo); w < 0}, 





and H = Hi U H_. The task is to perform credit approval based on the 
income x. The weights w and wo must be learned from data. 


(a) What are the VC dimensions of H and H+? 
(b) Which model will you pick (before looking at data)? Explain why. 





The exercise illustrates several important points about learning from data, and 
we are going to beat these points into the ground. First, the choice between H. 
and H+ brings the issue of generalization to light. Do you have enough data to 
use the more complex model, or do you have to use a simpler model to ensure 
that Ein © Eout? When learning from data, this is the first order of business. 
If you cannot expect Ein © Eout, you are doomed from step 1. Suppose you 
need to choose one of the simpler models to get good generalization. H4} and 
H- are of equal ‘simplicity’. As far as generalization is concerned, both models 
are equivalent. Let’s play through both choices. 

Suppose you choose H_. The nature of credit approval is that if you have 
more money, you are less likely to default. So, when you look at the data, it 
will look something like this (blue circles are +1) 


xX X X X XK COOCUGDOO O 





No matter how you try to fit this data using hypotheses in H_, the in-sample 
error will be approximately 4. Nevertheless, you will output one such final 
hypothesis g_. You will have succeeded in that Fin(g_) ~ Eout(g_). But you 
will have failed in that you will tell your client to expect about 50% out-of- 
sample error, and he will laugh at you. 

Suppose, instead, you choose H+. Now you will be able to select a hy- 
pothesis g4 with Ein(g+) ~ 0; and, because of good generalization, you will 
be able to tell the client that you expect near perfect out-of-sample accuracy. 

As in this credit example, you often have auxiliary knowledge about the 
problem. In our example, we have reason to pick H+ over H_, and by do- 


5This is true from the informational point of view. However, from the computational 
point of view, a hint can still help. For example, knowing that the target is linear will 
considerably speed up your search by focusing it on linear models. 
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ing so, we are using the hint that one’s credit status, by its very nature, is 
monotonically increasing in income. 


Moral. Don’t throw any old H at the problem just because your 
H is small and will give good generalization. More often than not, 


you will learn that you failed. Throw the right H at the problem, 
and if, unfortunately, you fail you will know it. 





Exercise 9.13 
For the problem in Exercise 9.12, why not try a hybrid procedure: 
Start with H+. If H+ gives low Ein, stop; if not, use H_. 


If your hybrid strategy stopped at H+, can you use the VC bound for H+? 
Can you use the VC-bound for H? 


(Problem 9.20 develops a VC-type analysis of such hybrid strategies within 
a framework called Structural Risk Minimization (SRM).) 


Hints are pieces of prior information about the learning problem. By prior, we 
mean prior to looking at the data. Here are some common examples of hints. 


Symmetry or Anti-symmetry hints: f(x) = f(—x) or f(x) = —f(—x). For 
example, in a financial application, if a historical pattern x means you should 
buy the stock, then the reverse pattern —x often means you should sell. 


Rotational invariance: f(x) depends only on ||x||. 


General Invariance hints: For some transformation T, f(x) = f(7x). Invari- 
ance to scale, shift and rotation of an image are common in vision applications. 


Monotonicity hints: f(x + Ax) > f(x) if Ax > 0. Sometimes you may only 
have monotonicity in some of the variables. For example, credit approval 
should be a monotonic function of income, but perhaps not monotonic in age. 


Convexity hint: f(nx + (1 — n)x’) < nf(x) + (1 — n) f(x’) for0 <7 <1. 


Perturbation hint: f is close to a known function f’, so f = f’+6f, where of 
is small (sometimes called a catalyst hint). 


One way to use a hint is to constrain the hypothesis set so that all hy- 
potheses satisfy the hint: start with a hypothesis set H and throw out all 
hypotheses which do not satisfy the constraint to obtain H. This will directly 
lower the size of the hypothesis set, improving your generalization ability. The 
next exercise illustrates the mechanics of this approach. 
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Exercise 9.14 


Consider the linear model with the quadratic transform in two dimensions. 
So, the hypothesis set contains functions of the form 





: 2 2 
h(x) = sign(wo + wia1 + were + w3xr{ + warg + W50122). 


Determine constraints on w so that: 
(a) h(x) is monotonically increasing in x. 
(b) A(x) is invariant under an arbitrary rotation of x. 


(c) The positive set {x : h(x) = +1} is convex. 


Since the hint is a known property of the target, throwing out hypotheses that 
do not match the hint should not diminish your ability to fit the target function 
(or the data). Unfortunately, this need not be the case if the target function 
was not in your hypothesis set to start with. The simple example below 
explains why. The figure shows a hypothesis set with three hypotheses (blue) 
and the target function (red) which is known to be monotonically increasing. 





H = {h*,hy, ha} 














Though f is monotonic, the best approximation to f in H is h*, and h* is not 
monotonic. So, if you remove all non-monotonic hypotheses from H, all the 
remaining hypotheses are bad and you will underfit heavily. 


The best approximation to f in H may not satisfy your hint. 


When you outright enforce the hint, you certainly ensure that the final hypoth- 
esis has the property that f is known to have. But you also may exaggerate 
deficiencies that were in the hypothesis set to start with. Strictly enforcing 
a hint is like a hard constraint, and we have seen an example of this before 
when we studied regularization and the hard-order constraint in Chapter 4: 
we ‘softened’ the constraint, thereby giving the algorithm some flexibility to fit 
the data while still pointing the learning toward simpler hypotheses. A similar 
situation holds for hints. It is often better to inform the learning about the 
hint but allow the flexibility to violate the hint a little if that is warranted for 
fitting the data - that is, to softly enforce the hint. A popular and general 
way to softly enforce a hint is using virtual examples. 
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9.3.1 Virtual Examples 


Hints are useful because they convey additional information that may not be 
represented in the finite data set. Virtual examples augment the data set so 
that the hint becomes represented in the data. The learning can then just focus 
on this augmented data without needing any new machinery to deal explicitly 
with the hint. The learning algorithm, by trying to fit all the data (the real 
and virtual examples) can trade-off satisfying the known hint about f (by 
attempting to fit the virtual examples) with approximating f (by attempting 
to fit the real examples). In contrast, when you explicitly constrain the model, 
you cannot take advantage of this trade-off; you simply do the best you can 
to fit the data while obeying the hint. If by slightly violating the hint you can 
get a hypothesis that is a supreme fit to the data, it’s just too bad, because 
you don’t have access to such a hypothesis. Let’s set aside the ideological goal 
that known properties of f must be preserved and focus on the error (both in 
and out-of-sample); that is all that matters. 

The general approach to constructing virtual examples is to imagine nu- 
merically testing if a hypothesis h satisfies the known hint. This will suggest 
how to build an error function that would equal zero if the hint is satisfied 
and non-zero values will quantify the degree to which the hint is not satisfied. 
Let’s see how to do this by concrete example. 


Invariance Hint. Let’s begin with a simple example, the symmetry hint 
which says that the target function is symmetric: f(x) = f(—x). How would 
we test that a particular h has symmetry? Generate a set of arbitrary virtual 
pairs {Vm,—Vm}, m = 1,...,M (v for virtual), and test if h(vm) = h(vi,) 
for every virtual pair. Thus, we can define the hint error 


Certainly, for the target function, Fhint(f) = 0. By allowing for non-zero 
Frint, we can allow the possibility of slightly violating symmetry, provided 
that it gives us a much better fit of the data. So, define an augmented error 


Eaug(h) = Ejn(h) ar AFpint (A) (9.3) 


to quantify the trade-off between enforcing the hint and fitting the data. By 
minimizing Haug, we are not explicitly enforcing the hint, but encouraging it. 
The parameter A controls how much we emphasize the hint over the fit. This 
looks suspiciously like regularization, and we will say more on this later. 
How should one choose A? The hint conveys information to help the learn- 
ing. However, overemphasizing the hint by picking too large a value for \ can 
lead to underfitting just as over-regularizing can. If A is too large, it amounts 
to strictly enforcing the hint, and we already saw how that can hurt. The 
answer: use validation to select A. (See Chapter 4 for details on validation.) 
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How should one select virtual examples {Vm, —Vm} to compute the hint 
error? As a start, use the real examples themselves. A virtual example for the 
symmetry hint would be {x,,—x,}, for n =1,...,N. 


Exercise 9.15 
Why might it be a good idea to use the data points to compute the hint 
error? When might it be better to use more hint examples? 


[Hint (©): Do you care if a hypothesis violates the hint in a part of the 
input space that has very low probability?] 


A general invariance partitions the input space into disjoint regions, 
X= Ura 


Within any partition, the target function is constant. That is, if x,x’ € Xa 
then f(x) = f(x’). A set of transformations 7 can be used to implicitly define 
the regions in the partition: x and x’ are in the same partition Xa if and only 
if x’ can be obtained from x by applying a composition of transformations 
from 7. As an example, suppose f(x) is invariant under arbitrary rotation of 
the input x in the plane. Then the invariant set X, is the circle of radius r 
centered on the origin. For any two points on the same circle, the target 
function evaluates to the same value. To generate the virtual examples, take 
each data point x, and determine (for example randomly) a virtual example x’, 
that is in the same invariant set as x,,. The hint error would then be 

ia 

Enint (h) = N N (h(xn) E h(x). 
n=1 

Notice that we can always compute the hint error, even though we do not 
know the target function on the virtual examples. 


Monotonicity Hint. To test if h is monotonic, for each data point x, 
we construct x}, = x, + Ax, with Ax, > 0. Monotonicity implies that 
h(xi,) > h(xn). We can thus construct a hint error of the form 

ix 

Bhine(h) = 55 J Aan) = hh)? Ox) < AEn). 
n=} 

Each term in the hint error penalizes violation of the hint, and is set to be zero 
if the monotonicity is not violated at x,, regardless of the value of (h(xn) — 
hg! ))*. 


n 


Exercise 9.16 


Give hint errors for rotational invariance, convexity and perturbation hints. 
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9.3.2 Hints Versus Regularization 


Regularization (Chapter 4) boiled down to minimizing an augmented error 
À 
Ewe(h) = En (h) + S90), 


where the regularizer Q(h) penalized the ‘complexity’ of h. Implementing a 
hint by minimizing an augmented error as in (9.3) looks like regularization 
with regularizer Fyint(h). Indeed the two are similar, but we would like to 
highlight the difference. Regularization directly combats the effects of noise 
(stochastic and deterministic) by encouraging a hypothesis to be simpler: the 
primary goal is to reduce the variance, and the price is usually a small increase 
in bias. A hint helps us choose a small hypothesis set that is likely to contain 
the target function (see Exercise 9.12): the primary motivation is to reduce 
the bias. By implementing the hint using a penalty term Enint(h), we will be 
reducing the variance but there will usually be no price to pay in terms of 
higher bias. Indirectly, the effect is to choose the right smaller hypothesis set. 


Regularization fights noise by pointing the learning toward simpler hypothe- 
ses; this applies to any target function. Hints fight bias by helping us choose 
wisely from among small hypothesis sets; a hint can hurt if it presents an 
incorrect bit of information about the target. 


As you might have guessed, you can use both tools and minimize 
Eaug(h) = Ejn(h) zr gL + A2Enint (h), 


thereby combating noise while at the same time incorporating information 
about the target function. It is possible for the two tools to send the learning in 
different directions, but this is rare (for example a hint that the target is a 10th 
order polynomial may get overridden by the regularizer which tells you there 
is not enough data to learn a 10th order polynomial). It generally pays huge 
dividends to incorporate properties of the target function, and regularization 
is a must because there is always noise. 


9.4 Data Cleaning 


Although having more data is often a good idea, sometimes less is more. 
Data cleaning attempts to identify and remove noisy or hard examples. A 
noisy ‘outlier’ example can seriously lead learning astray (stochastic noise). 
A complex data point can be just as bad (deterministic noise). To teach a 
two year old mathematics, you may start with induction or counting 1, 2,3. 
Both are correct ‘data points’, but the complex concepts of induction will only 
confuse the child (a simple learner). 

Why remove the noisy data as opposed to simply incurring in-sample error 
on them? You don’t just incur in-sample error on the noisy data. You incur 
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additional, unwanted in-sample error on the non-noisy data when the learning 
is led astray. When you remove the noisy data, you have fewer data points to 
learn from, but you will gain because these fewer data points have the useful 
information. The following regression example helps to illustrate. 








y 
y 


o 
fe} [recs eee ae 
o o g fe) 




















(a) Fit led astray by outlier (b) Fit without outlier 


When we retain the outlier point (red), the linear fit is distorted (even the 
sign of the slope changes). The principle is not hard to buy into: throw away 
detrimental data points. The challenge is to identify these detrimental data 
points. We discuss two approaches that are simple and effective. 


9.4.1 Using a Simpler Model to Identify Noisy Data 


The learning algorithm sees everything through the lens of the hypothesis set. 
If your hypothesis set is complex, then you will be able to fit a complex data 
set, so no data point will look like noise to you. On the other hand, if your 
hypothesis set is simple, many of the data points could look like noise. We 
can identify the hard examples by viewing the data through a slightly simpler 
model than the one you will finally use. One typical choice of the simpler 
model is a linear model. Data that the simpler model cannot classify are the 
hard examples, to be disregarded in learning with the complex model. It is a 
bit of an art to decide on a model that is simple enough to identify the noisy 
data, but not too simple that you throw away good data which are useful for 
learning with the more complex final model. 

Rather than relying solely on a single simpler hypothesis to identify the 
noisy data points, it is generally better to have several instances of the sim- 
pler hypothesis. If a data point is misclassified by a majority of the simpler 
hypotheses, then there is more evidence that the data point is hard. One easy 
way to generate several such simpler hypotheses is to use the same hypothesis 
set (e.g. linear models) trained on different data sets generated by sampling 
data points from the original data with replacement (this is called Bootstrap- 
sampling from the original data set). Example 9.3 illustrates this technique 
with the digits data from Example 3.1. 
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Example 9.3. We randomly selected 500 examples of digits data D, and 
generated more than 10,000 data sets of size 500. To generate a data set, 
we sampled data points from D with replacement (Bootstrap sampling).Each 
bootstrapped data set is used to construct a ‘simple’ hypothesis using the k- 
Nearest Neighbor rule for a large k. Each simple hypothesis misclassifies some 
of the data that was used to obtain that hypothesis. If a particular data point 
is misclassified by more than half the hypotheses which that data point influ- 
enced, we identify that data point as bad (stochastic or deterministic noise). 
Figure 9.5 shows the bad data for two different choices of the ‘simple’ model: 
(a) 5-NN; and (b) 101-NN. The simpler model (101-NN) identifies more con- 
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(a) 5-Nearest Neighbor (b) 101-Nearest Neighbor 


Figure 9.5: Identifying noisy data using the ‘simple’ nearest neighbor rule. 
The data in black boxes are identified as noisy (a) using 5-Nearest Neighbor 
and (b) using 101-Nearest Neighbor. 101-Nearest Neighbor, the simpler of 
the two models, identifies more noisy data as expected. 


fusing data points, as expected. The results depend on our choice of k in the 
k-NN algorithm. If k is large (generating a constant final hypothesis), then all 
the ‘+1’-points will be identified as confusing, because the ‘—1’-points are in 
the majority. In Figure 9.5, the points identified as confusing are intuitively 
so — solitary examples of one class inside a bastion of the other class. 

If we choose, for our final classifier, the ‘complex’ 1-Nearest Neighbor rule, 
Figure 9.6(a) shows the classifier using all the data (a reproduction of Fig- 
ure 6.2(a) in Chapter 6). A quick look at the decision boundary for the same 
1-Nearest Neighbor rule after removing the bad data (see Figure 9.6(b)) visu- 
ally confirms that the overfitting is reduced. What matters is the test error, 
which dropped from 1.7% to 1.1%, a one-third reduction in test error rate. 

If computational resources permit, a refinement of this simple and effective 
method is to remove the noisy points sequentially: first remove the most 
confusing data point; now rerun the whole process to remove the next data 
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intensity intensity 


(a) All data, Fest = 1.7% (b) Cleaned data, Etest = 1.1% 


Figure 9.6: The 1-Nearest Neighbor classifier (a) using all the data and (b) 
after removing the bad examples as determined using 101-Nearest Neighbor. 
Removing the bad data gives a more believable decision boundary. 


point and so on. The advantage of this sequential process is that a slightly 
confusing example may be redeemed after the true culprit is removed. 

In our example, 101-Nearest Neighbor was the ‘simple’ model, and an 
example was bad if half the simple hypotheses misclassified that example. 
The simpler the model, the more outliers it will find. A lower threshold on 
how often an example is misclassified also results in more outliers. These are 
implementation choices, and we’re in the land of heuristics here. In practice, 
a slightly simpler model than your final hypothesis and a threshold of 50% for 
the number of times an example is misclassified are reasonable choices. 














9.4.2 Computing a Validation Leverage Score 


If, after removing an example from your data set, you can improve your test 
error, then by all means remove the example. Easier said than done — how do 
you know whether the test error will improve when you remove a particular 
data point? The answer is to use validation to estimate the adverse impact 
(or leverage) that a data point has on the final hypothesis g. If an example 
has large leverage with negative impact, then it is a very ‘risky’ example and 
should be disregarded during learning. 

Let’s first define leverage and then see how to estimate it. Recall that 
using data set D, the learning produces final hypothesis g. When we discussed 
validation in Chapter 4, we introduced the data set D, which contained all 
the data in D except (Xn, Yn). We denoted by g, the final hypothesis that you 
get from from Dp. We denote the leverage score of data point (Xn, Yn) by 


ly = out (g) = Eout(g;,)- 
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The leverage score measures how valuable (Xn, yn) is. If 2, is large and positive, 
(Xn, Yn) is detrimental and should be discarded. To estimate n, we need a 
validation method to estimate Fout. Any method for validation can be used. 
In Chapter 4 we introduced Eev, the cross-validation estimate of Eout(g); Eev 
is an unbiased estimate of the out-of-sample performance when learning from 
N — 1 data points. The algorithm to compute Eey takes as input a data set 
D and outputs E.y(P), an estimate of the out-of-sample performance for the 
final hypothesis learned from D. Since we get g, from Dn, if we run the cross 
validation algorithm on Dn, we will get an estimate of Eout(g;,). Thus, 


ln = Ew (D) — Eo (Dn). (9.4) 


You can replace the Eey algorithm above with any validation algorithm, pro- 
vided you run it once on D and once on Dp. The computation of Eev(D) only 
needs to be performed once. Meanwhile, we are asking whether each data 
point in turn is helping. If we are using cross-validation to determine whether 
a data point is helping, Eev(Dn) needs to be computed for n = 1,..., N, re- 
quiring you to learn N(N — 1) times on data sets of size N — 2. That is a 
formidable feat. For linear regression, this can be done more efficiently (see 
Problem 9.21). Example 9.4 illustrates the use of validation to compute the 
leverage for our toy regression example that appeared earlier. 


Example 9.4. We use a linear model to fit the data shown in Figure 9.7(a). 
There are seven data points, so we need to compute 


Ew (D) and  Be(D1),..., Eev(D7). 
We show the leverage £n ~ Eev (D)— Eev (Dn) for n = 1,...,7 in Figure 9.7(b). 
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(a) Data for linear regression (b) Leverage estimates 
Figure 9.7: Estimating the leverage of data points using cross validation. 


(a) The data points with one outlier (b) The estimated leverage. The outlier 
has a large positive leverage indicative of a very noisy point. 














Only the outlier has huge positive leverage, and it should be discarded. 
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9.5 More on Validation 


We introduced validation in Chapter 4 as a means to select a good regulariza- 
tion parameter or a good model. We saw in the previous section how validation 
can be used to identify noisy data points that are detrimental to learning. Val- 
idation gives an in-sample estimate for the out-of-sample performance. It lets 
us certify that the final hypothesis is good, and is such an important concept 
that many techniques have been developed. Time spent augmenting our tools 
for validation is well worth the effort, so we end this chapter with a few ad- 
vanced techniques for validation. Don’t fear. The methods are advanced, but 
the algorithms are simple. 


9.5.1 Rademacher Penalties 


The Rademacher penalty estimates the optimistic bias of the in-sample error. 


Eout(g) = Ein(g) + overfit penalty . 
ee 


Rademacher penalty estimates this 


One way to view the overfit penalty is that it represents the optimism in the 
in-sample error. You reduced the error all the way to Ein(g). Part of that 
was real, and the rest was just you fooling yourself. The Rademacher overfit 
penalty attempts to estimate this optimism. 

The intuitive leap is to realize that the optimism in Ej, is a result of the 
fitting capability of your hypothesis set H, not because, by accident, your data 
set was easy to fit. This suggests that if we can compute the optimism penalty 
for some data set, then we can apply it to other data sets, for example, to the 
data set D = (x1, y1),---, (Xn, Yn) that we are given. So let’s try to find a data 
set for which we can compute the optimism. And indeed there is one such data 
set, namely a random one. Consider the data set 


p' = (ia Ti), Me (Xx; Ty), 


where 7r1,...,7 are generated independently by a random target function, 

Pran = +1] = 4. The rn are called Rademacher variables. The inputs are the 

same as the original data set, so in that sense D’ mimics D. After learning 

on D’ you produce hypothesis gr with in-sample error Ef (gr) (we use the 

prime (-)’ to denote quantities with respect to the random problem). Clearly, 
1 


El at(gr) = 4, because the target function is random, so the optimism is 4 — 


E; (gr). Using this measure of optimism for our actual problem, we get the 


Rademacher estimate 


Eout(g) = E; (g) =F (5 ~ Ei, (Gr) : 


The Rademacher penalty is easy to compute: generate random targets for 
your data; perform an in-sample minimization to see how far below 4 you can 
get the error, and use that as a penalty on your actual in-sample error. 
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Exercise 9.17 


After you perform in-sample minimization to get gr, show that the 
Rademacher penalty for Fin(g) is given by 


which is proportional to the maximum correlation you can obtain with ran- 
dom signs using hypotheses in the data set. 


A better way to estimate the optimism penalty is to compute the expectation 
over different realizations of the Rademacher variables: E,[5 — E{,(gr)]. In 
practice, one generates several random Rademacher data sets and takes the 
average optimism penalty. This means you have to do an in-sample error 
minimization for each Rademacher data set. The good news is that, with high 
probability, just a single Rademacher data set suffices. 

It is possible to get a theoretical bound for Eout(g) using the Rademacher 
overfit penalty (just like we did with the VC penalty). The bound is universal 
(because it holds for any learning scenario — target function and input prob- 
ability distribution); but, unlike the VC penalty, it is data dependent, hence 
it tends to be tighter in practice. (Remember that the VC bound was inde- 
pendent of the data set because it computed mz(N) on the worst possible 
data set of size N.) It is easy to use the Rademacher penalty, since you just 
need to do a single in-sample error optimization. In this sense the approach is 
similar to cross-validation, except the Rademacher penalty is computationally 
cheaper (recall that cross-validation required N in-sample optimizations). 

The Rademacher penalty requires in-sample error minimization - you com- 
pute the worst-case optimism. However, it can also be used with regularization 
and penalized error minimization, because to a good approximation, penalized 
error minimization is equivalent to in-sample error minimization using a con- 
strained hypothesis set (see Chapter 4). Thus, when we say “in-sample error 
minimization,” read it as “run the learning algorithm.” 














9.5.2 The Permutation and Bootstrap Penalties 


There are other random data sets for which we can compute the optimism. 
The permutation and Bootstrap estimates are based off these, and are just as 
easy to compute using an in-sample error minimization. 


Permutation Optimism Penalty. Again, we estimate the optimism by 
considering how much we can overfit random data, but now we choose the 
y-values to be a random permutation of the actual y-values in the data, as 
opposed to random signs. Thus, the estimate easily generalizes to regression 
as well. The learning problem generated by a random permutation mimics the 
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actual learning problem more closely since it takes both the x and y values 
from the data. Consider a randomly permuted data set, 


Dr = (riser) -- -> (XN, Yay )» 


where m is a random permutation of (1,..., N). After in-sample error min- 
imization on this randomly permuted data, you produce gw with in-sample 
error on Dy equal to EZ (gr). Unlike with the Rademacher random learning 
problem, the out-of-sample error for this random learning problem is not sim- 
ply 4, because the target values are not random signs. We need to be a little 
more careful and obtain the expected error for the joint (x,y) target distri- 
bution that describes this random learning problem (see Problem 9.12 for the 
details). However, the final result is the same, namely that the overfit penalty 
(optimism on the random problem) is proportional to the correlation between 
the learned function gr(Xn) and the randomly permuted target values yr, .° 
The permutation estimate of the out-of-sample error is 


N 
Boue(9) = Ein (9) + zi) (Ura ~ Den) (95) 


where ¥ is the mean target value in the data. The second term is the permu- 
tation optimism penalty obtained from a single randomly permuted data set. 
Ideally, you should average the optimism penalty over several random permu- 
tations, even though, as with the Rademacher estimate, one can show that a 
single permuted data set suffices to give a good estimate with high probability. 
As already mentioned, the permutation estimate can be applied to regression, 
in which case it is more traditional not to rescale the sum of squared errors by 


+ (which is appropriate for classification). In this case, the on becomes =. 


Bootstrap Optimism Penalty. The Rademacher and permutation data 
sets occupy two extremes. For the Rademacher data set, we choose the y- 
values as random signs, independent of the actual values in the data. For 
the permutation data set, we take the y-values directly from the data and 
randomly permute them - every target value is represented once, and can be 
viewed as sampling the target values from the data without replacement. The 
Bootstrap optimism penalty generates a random set of targets, one for each 
input, by sampling y-values in the data, but with replacement. Thus, the 
y-values represent the observed distribution, but not every y-value may be 
picked. The functional form of the estimate (9.5) is unchanged; one simply 
replaces all the terms in the overfit penalty with the corresponding terms for 
the Bootstrapped data set. 


The permutation and Rademacher estimates may be used for model selec- 
tion just as any other method which constructs a proxy for the out-of-sample 


6 Any such correlation is spurious, as no input-output relationship exists. 
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error. In some cases, as with leave-one-out cross validation, it is possible to 
compute the average (over the random data sets) of the overfit penalty analyt- 
ically, which means that one can use the permutation estimate without having 
to perform any explicit in-sample error optimizations. 


Example 9.5 (Permutation Estimate for Linear Regression with Weight De- 
cay). For linear models, we can compute the permutation estimate without 
explicitly learning on the randomly permuted data (see Problem 9.18 for de- 
tails). The estimate for Eout is 














; 2 Š 
Eout (9) = Ein(9) T; Er E SG, aa Del) 








n=1 
262 fH1 
= Balo) +E (trace) =). (9.6) 
where 67 = x 4var(y) is the unbiased estimate for the variance of the target 


values and H = X(X™X + 41)~!XT is the hat-matrix that depends only on 
the input data. When there is no regularization, A = 0, so H is a projection 
matrix (projecting onto the columns of X), and trace(H) = d+ 1. If the first 
column of X is a column of ones (the constant term in the regression), then 
H1 = 1 and the permutation estimate becomes 


z 267d 
Eout(g) = Ei; (g) 9 x 





This estimate is similar to the AIC estimate from information theory. It is 
also reminiscent of the test error in Exercise 3.4 which resulted in 


2 

Eon (9) = En lo) + 2S), 

where g? was the noise variance. The permutation estimate uses Gy in place 
of o*. Observe that trace(H) — 417H1 plays the role of an effective dimen- 
sion, deg. Problem 4.13 in Chapter 4 suggests some alternative choices for an 
effective dimension. 














9.5.3 Rademacher Generalization Bound 


The theoretical justification for the Rademacher optimism penalty is that, as 
with the VC bound, it can be used to bound Eout. A similar though math- 
ematically more complex justification can also be shown for the permutation 
estimate. The mathematical proof of this is interesting in that it presents new 
tools for analyzing generalization error. An energetic reader may wish to mas- 
ter these techniques by working through Problem 9.16. We present the result 
for a hypothesis set that is closed under negation (h € H implies —h € H). 





© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:9-33 











e-9. LEARNING AIDES 9.5. MORE ON VALIDATION 


Theorem 9.6 (Rademacher Bound). With probability at least 1 — ô, 


Eout(9) < Hin(g) + maxnen {$ Diner rahn) } +0 (4/4 los $). 


The probability is with respect to the data set and a single realization of the 
Rademacher variables. 


The overfit penalty in the bound is twice the Rademacher optimism penalty 
(we can ignore the third term which does not depend on the hypothesis set 
and is small). The bound is similar to the VC bound in that it holds for any 
target function and input distribution. Unlike the VC bound, it depends on 
the data set and can be significantly tighter than the VC bound. 


9.5.4 Out-of-Sample Error Estimates for Regression 


We close this chapter with a discussion of validation methods for regression, 
and a model selection experiment in the context of regression, to illustrate 
some of the issues to think about when using validation for model selection. 
Cross-validation and the permutation optimism penalty are general in that 
they can be applied to regression and only require in-sample minimization. 
They also require no assumptions on the data distribution to work. We call 
these types of approaches sampling approaches because they work using sam- 
ples from the data. 

The VC bound only applies to classification. There is an analog of the VC 
bound for regression. The statistics community has also developed a number 
of estimates. The estimates are multiplicative in form, so Fou, ~ (1+2(p)) Fin, 
where p = Zi is the number of data points per effective degree of freedom (see 
Example 9.5 for one estimate of the effective degrees of freedom in a linear 
regression setting). 


VC penalty factor Eout < a KR = En 








Akaike’s FPE Eout ~ —— Ein 
p—1 
la N 
Schwartz criterion Bout © (1 + as -) Ein 
p- 
p? 
Craven & Wahba’s GCV Eout © p-p 
D— 


For large p, FPE (final prediction error) and GCV (generalized cross valida- 
tion) are similar: Eout = (1 + 2p>!+ O(p~?))Ein. This suggests that the 
simpler estimator Eout = (1 + 2det) Fin might be good enough, and, from the 


practical point of view, it is good enough. 
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The statistical estimates (FPE, Schwartz and GCV) make model assump- 
tions and are good asymptotic estimates modulo those assumptions. The VC 
penalty is more generally valid. However, it tends to be very conservative. In 
practice, the VC penalty works quite well, as we will see in the experiments. 


Validation Experiments with Regression. The next exercise sets up an 
experimental design for studying validation and model selection. We consider 
two model selection problems. The first is to determine the right order of the 
polynomial to fit the data. In this case, we have models Ho, H1, H2,..., Cor- 
responding to the polynomials of degree 0,1,2,.... The model selection task 
is to select one of these models. The second is to determine the right value 
of the regularization parameter A in the weight decay regularizer. In this 
case, we fix a model (say) Hg and minimize the augmented error Eaug(w) = 
Ex, (w) + ww; we have models Hg(A1), Ha(Az2), He(A3),---, correspond- 
ing to the choices Ay, 2, A3,... for the regularization parameter. The model 
selection task is to select one of these models, which corresponds to selecting 
the appropriate amount of regularization. 


Exercise 9.18 [Experimental Design for Model Selection] 
Use the learning set up in Exercise 4.2. 


(a) Order Selection. For a single experiment, generate a” € [0, 1] and 
the target function degree in {0,...,30}. Set N = 100 and generate 
a data set D. Consider the models up to order 20: Ho,..., H20. For 
these models, obtain the final hypotheses go,...,g20 and their test 
errors B®) nana Now obtain estimates of the out-of-sample 


out?** out * 


errors, BY aS B29) 


eters cut’ using the following validation estimates: 


VC penalty; LOO-CV; Permutation estimate; FPE. 


Plot the out-of-sample error and the validation estimates versus the 
model order, averaged over many experiments. 


(b) \ Selection. Repeat the previous exercise to select the regularization 
parameter. Fix the order of the model to Q = 5. Randomly generate 
a° € [0,1] and the order of the target from {0,...,10}, and set 


N = 15. Use different models with A € [0, 300]. 


(c) For model selection, you do not need a perfect error estimate. What 
is a weaker, but sufficient, requirement for a validation estimate to 
give good model selection? 


(d) To quantify the quality of an error estimate, we may use it to select 
a model and compute the actual out-of-sample error of the selected 
model, versus the model with minimum out-of-sample error (after 
training with the same data). The regret is the relative increase 
in error incurred by using the selected model versus the optimum 
model. Compare the regret of different validation estimates, taking 
the average over many runs. 





© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:9—35 











e-9. LEARNING AIDES 9.5. MORE ON VALIDATION 


11 | 
— Eout 



























1 — Ew 1 
5 — Eperm 8 
50.9 — Evo 509 
0.8 0.8 
ORK 0.7 
5 15 0 5 10 
Model Order Q Regularization Parameter 
(a) Eout estimate vs. model order (b) Fout estimate vs. A 


Figure 9.8: Validation estimates of Fout. (a) polynomial models of dif- 
ferent order Q; (b) models with different regularization parameter A. The 
estimates and Eout are averaged over many runs; the 3 validation methods 
shown are Ecv (leave-one-out cross validation estimate), Eperm (permuta- 
tion estimate), and Evc (VC penalty bound). The VC penalty bound is far 
above the plot because it is so loose (indicated by the arrow); we have shifted 
it down to show its shape. The VC estimate is not useful for estimating Eout 
but can be useful for model selection. 


The average performance of validation estimates of Eout from our implemen- 
tation of Exercise 9.18 are shown in Figure 9.8. As can be seen from the figure, 
when it comes to estimating the out-of-sample error, the cross validation es- 
timate is hands down favorite (on average). The cross validation estimate 
has a systematic bias because it is actually estimating the out-of-sample error 
when learning with N—1 examples, and the actual out-of-sample error with N 
examples will be lower. The out-of-sample error is asymmetric about its min- 
imum - the price paid for overfitting increases more steeply than the price paid 
for underfitting. As can be seen by comparing where the minimum of these 
curves lies, the VC estimate is the most conservative. The VC overfit penalty 
strongly penalizes complex models, so that the optimum model (according to 
the VC estimate) is much simpler. 


We now compare the different validation estimates for model selection (Ex- 
ercise 9.18(d)). We look at the regret, the average percentage increase in Eout 
from using an Eoyt-estimate to pick a model versus picking the optimal model. 
These results are summarized in the table of Figure 9.9. The surprising thing 
is that the permutation estimate and the VC estimate seem to dominate, even 
though cross validation gives the best estimate of Fou; on average. This has to 
do with the asymmetry of Eout around the best model. It pays to be conser- 
vative, and err on the side of the simpler model (underfitting) than on the side 
of the more complex model (overfitting), because the price paid for overfitting 
is far steeper than the price paid for underfitting. The permutation estimate 
tends to underfit, and the VC estimate even more so; these estimates are pay- 
ing the ‘small’ price for underfitting most of the time. On the other hand, 
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Eout Estimate Order Selection A Selection 
Regret Avg. Order | Regret Avg. A 

Pout 0 10.0 0 7.93 

Ei 43268 20.0 117 0 
Ecv 540 9.29 18.8 23.1 
E 185 7.21 5.96 9.57 
Eppe 9560 11.42 51.3 18.1 
Eve 508 5.56 3.50 125 


Figure 9.9: Using estimates of Fout model selection (Exercise 9.18(d)). The 
best validation estimate is highlighted in bold. For order selection, the order 


Q € {0,..., 20}, and for regularization A € [0, 300]. All results are averaged 
Hourly) = Hout Copenmal) 5 


over many thousands of experiments. (Regret = Ee GEN 


cross validation picks the correct model most of the time, but occasionally 
goes for too complex a model and pays a very steep price. In the end, the 
more conservative strategy wins. Using Ein is always the worst, no surprise. 

You always need regularization, so A = 0 is bad. Suppose we remove that 
case from our set of available models. Now repeat the experiment for selecting 
the A in the range [0.1, 300], as opposed to the range [0,300], a very minor 
difference in the available choices for A. The results become 


AE (0.1, 300] Pout Ein Ecv Exerm EFpE Evo 
Regret 0 1.81 0.44 0.39 0.87 0.42 


The permutation estimate is best. Cross validation is now much better than 
before. The in-sample error is still by far the worst. Why did all methods get 
much better? Because we curtailed their ability to err in favor of too com- 
plex a model. What we were experiencing is ‘overfitting’ during the process of 
model selection, and by removing the choice À = 0, we ‘regularized’ the model 
selection process. The cross validation estimate is the most accurate, but also 
the most sensitive, and has a high potential to be led astray, thus it benefited 
most from the regularization of the model selection. The FPE estimate is a 
statistical estimate. Such statistical estimates typically make model assump- 
tions and are most useful in the large N limit, so their applicability in practice 
may be limited. The statistical estimate fares okay, but, for this experiment, 
it is not really competitive with the permutation approach, cross validation, 
or the VC penalty. 


Our recommendation. Do not arbitrarily pick models to select among. One 
can ‘overfit’ during model selection. Carefully decide on a set of models and 
use a robust validation method to select one. There is no harm in using mul- 
tiple methods for model selection, and we recommend both the permutation 
estimate (robust, easy to compute and generally applies) and cross validation 
(general, easy to use and usually the most accurate for estimating Fout). 
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9.6 Problems 


Problem 9.1 Consider data (0,0, +1), (0,1, +1), (5,5,—1) (in 2-D), 
where the first two entries are 11, £2 and the third is y. 


(a) Implement the nearest neighbor method on the raw data and show the 
decision regions of the final hypothesis. 


(b) Transform to whitened coordinates and run the nearest neighbor rule, 
showing the final hypothesis in the original space. 


(c) Use principal components analysis and reduce the data to 1 dimension 
for your nearest neighbor classifier. Again, show the decision regions of 
the final hypothesis in the original space. 


Problem 9.2 We considered three forms of input preprocessing: centering, 
normalization and whitening. The goal of input processing is to make the 
learning robust to unintentional choices made during data collection. 


Suppose the data are xn and the transformed vectors Zn. Suppose that dur- 
ing data collection, x}, a mutated version of Xn, were measured, and input 
preprocessing on x;, produces z/,. We would like to study when the z/, would 
be the same as the zn. In other words, what kinds of mutations can the data 
vectors be subjected to without changing the result of your input processing. 
The learning will be robust to such mutations. 


Which input processing methods are robust to the following mutations: 


Bias: x}, = Xn + b where b is a constant vector. 
Uniform scaling: x}, = axn, where a > 0 is a constant. 
Scaling: x}, = Axn where A is a diagonal non-singular matrix. 


Linear transformation: x/, = Ax, where A is a non-singular matrix. 


Problem 9.3 Let £ be a symmetric positive definite matrix with eigen- 
decomposition © = UTU” (see the Appendix), where U is orthogonal and I is 
positive diagonal. Show that 


S2=Ur2U" and = SO? = UŻU", 


What are T? and T72? 


Problem 9.4 If A and V are orthonormal bases, and A = Vz, show that 
w = VTA and hence that w is an orthogonal matrix. 
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Problem 9.5 Give the SVD of matrix A defined in each case below. 


A is a diagonal matrix. 


(a 
(b 


) A is a matrix with pairwise orthogonal rows. 
(c) A is a matrix with pairwise orthogonal columns. 


d) Let A have SVD UTV” and QQ =I. What is the SVD of QA. 


( 
(e) A has blocks A; along the diagonal, where A; has SVD U.T:V;, 


0 


Problem 9.6 For the digits data, suppose that the data were not centered 
first in Example 9.2. Perform PCA and obtain a 2-dimensional feature vector 
(give a plot). Are the transformed data whitened? 


Problem 9.7 Assume that the input matrix X is centered, and construct 
the reduced feature vector zn = T AVE where V; is the matrix of top-k 
right singular vectors of X. Show that the feature vectors z,, are whitened. 


Problem 9.8 One critique of PCA is that it selects features without 
regard to target values. Thus, the features may not be so useful in solving the 
learning problem, even though they represent the input data well. 


One heuristic to address this is to do a PCA separately for each class, and use 
the top principal direction for each class to construct the features. Use the 
digits data. Construct a two dimensional feature as follows. 


: Center all the data. 
2: Perform PCA on the +1 data. Let vı be the top 
principal direction. 
: Perform PCA on the -1 data. Let vı be the top 
principal direction. 
: Use vı and və to obtain the features 


T T 
zı = Vix and Z2 = VOX. 





We applied the algorithm to the digits data, giving the features shown above. 
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(a) Give a scatter plot of your resulting two features. 
(b) Are the directions vi and v2 necessarily orthogonal? 
(c) Let Z = XV be the feature matrix, where V = [vi, v2]. Show that the 
best reconstruction of X from Z is 
{= 20! = xwi, 
where V' is the pseudo-inverse of V. 


(d) Is this method of constructing features supervised or unsupervised? 


Problem 9.9 Let X = UTV" be the SVD of X. Give an alternative proof 
of the Eckart-Young theorem by showing the following steps. 


(a) Let X be a reconstruction of X whose rows are the data points in X 
projected onto k basis vectors. What is the rank(X)? 


Ag ah 8 5 o> 
(b) Show that IX — RIÈ > UTX — NV[= IP - UNV 
(c) Let Ô = U7XV. Show that rank(Î) < k. 


(d) Argue that the optimal choice for must have all off-diagonal elements 
zero. [Hint: What is the definition of the Frobenius norm?] 


(e) How many non-zero diagonals can Î have. 
(f) What is the best possible choice for I’. 
(g) Show that X = XVV™ results in such an optimal choice for I’. 


Problem 9.10 Data Snooping with Input Preprocessing. You are 
going to run PCA on your data X and select the top-k PCA features. Then, 
use linear regression with these k features to produce your final hypothesis. 
You want to estimate Four using cross validation. Here are two possible 
algorithms for getting a cross-validation error. 


Algorithm 1 Algorithm 2 
1: SVD: [U, T, V] = svd(X). 1: form =1: N do 
2: Get features Z = XV;, 2: Leave out (Xn, Yn) to obtain 
3: form =1: N do data Xn, yn. 
4: Leave out (Zn, yn) to obtain 3: SVD: [U7, r>, V7] =svd(X,). 
data Zn, Yn. 4: Get features Zn = Xn Vg. 
5: Learn w, from Zn, Yn 5: Learn w, from Zn, Yn. 
6: Error en = (x wy, — Yn)”. 6: Error en = (XZ V3w7, — yn)? 
7: Ey = average(e1,...,en). 7: E = average(e1,...,en). 


In both cases, your final hypothesis is g(x) = x" V;,w where w is learned from 
Z = XV, form Algorithm 1 and y. (Recall V; is the matrix of top-k singular 
singular vectors.) We want to estimate Hout(g) using either E1 or E2. 


(a) What is the difference between Algorithm 1 and Algorithm 2? 
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(b) Run an experiment to determine which is a better estimate of Hout (g). 


(i) Set the dimension d = 5, the number of data points N = 40 and 
k = 3. Generate random target function weights wy, normally dis- 
tributed. Generate N random normally distributed d-dimensional 
inputs X1,..., Xy. Generate targets yn = WẸFXn + en where en is 
independent Gaussian noise with variance 0.5. 
(ii) Use Algorithm 1 and 2 to compute estimates of Fout(g). 
(iii) Compute Eout(g). 
(iv) Report E1, E2, Eout (g). 
(v) Repeat parts (i)-(iv) 10° times and report averages Fy, E2, Eout- 
(c) Explain your results from (b) part (v). 
(d) Which is the correct estimate for Fout(g). 


Problem 9.11 From Figure 9.8(a), performance degrades rapidly as the 
order of the model increases. What if the target function has a high order? 
One way to allow for higher order, but still get good generalization is to fix 
the effective dimension deff, and try a variety of orders. Given X, deff depends 
on X. Fixing des (to say 7) requires changing À as the order increases. 


(a) For fixed deg, when the order increases, will A increase or decrease? 


(b) Implement a numerical method for finding A(defr) using one of the mea- 
sures for the effective number of parameters (e.g., de = trace(H?(A)), 
where H(A) = X(X™X+AI)~'X7 is the hat-matrix from Chapter 4). The 
inputs are deff, and X and the output should be A(derr, X). 


(c) Use the experimental design in Exercise 9.18 to evaluate this approach. 
Vary the model order in [0,30] and set def to 7. Plot the expected 
out-of-sample error versus the order. 


(d) Comment on your results. How should your plot behave if deg alone 
controlled the out-of-sample error? What is the best model order using 
this approach? 


Problem 9.12 In this problem you will derive the permutation opti- 
mism penalty in Equation (9.5). We compute the optimism for a particu- 
lar data distribution and use that to penalize Ein as in (9.5). The data is 
D = (x1, y1),---,(xw, yn), and the permuted data set is 


Dr = (X1, Yri); -<3 (XN, Yan) 


Define the ‘permutation input distribution’ to be uniform over x1,..., XN. 
Define a ‘permutation target function’ f» as follows: to compute target values 
fr(Xn), generate a random permutation m and set fx(Xn) = Yx,- SO, 


1 





© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:9—41 











e-9. LEARNING AIDES 9.6. PROBLEMS 


for m = 1,...,N, independent of the particular xn. We define £3,,(h) with 
respect to this target function on the inputs xi,..., XN. 











(a) Define BZ.4(h) = + Exe [(h(x) — fr (x))7], where expectation is with 
respect to the permutation input and target joint distribution. Show that 





N 


Eout(h) = Ty do Ex[(h(xn) — fr(Xn))"]. 


(b) Show that 
2 N 
Evur(h) = F + In x a (Xn) - 7)? ’ 


where y = $ J., Yn and s2 = + }.„ (yn — 7)? are the mean and variance 
of the target values. 


(c) In-sample error minimization on Dy yields gx. What is Ep (gr)? 


(d) The permutation optimism penalty is Egat (gr) — Em (gr). Show: 
N 
permutation optimism penalty = w) Yon — YG my(Xn). (9-7) 


(e) Show that the permutation optimism penalty is proportional to the cor- 
relation between the randomly permuted targets yx,, and the learned 
function gx(Xn). 


Problem 9.13 Repeat Problem 9.12, but , instead of defining the target 
distribution for the random problem using a random permutation (sampling the 
targets without replacement), use sampling of the targets with replacement (the 
Bootstrap distribution). 

[Hint: Almost everything stays the same.] 


Problem 9.14 In this problem you will investigate the Rademacher 
optimism penalty for the perceptron in one dimension, h(x) = sign(x — wo). 


a) Write a program to estimate the Rademacher optimism penalty: 
g y 
(i) Generate inputs x1,...,£&y and random targets r1,..., TN. 
ii) Find the perceptron gr with minimum in-sample error Fj (gr). 
g g 
(iii) Compute the optimism penalty as 5 — Ej, (gr). 
Run your program for N = 1,2,...,10° and plot the penalty versus N. 


(b) Repeat part (a) 10,000 times to compute the average Rademacher penalty 
and give a plot of the penalty versus N. 

(c) On the same plot show the function 1/V N; how does the Rademacher 
penalty compare to 1/\/N? What is the VC-penalty for this learning 
model and how does it compare with the Rademacher penalty? 
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Problem 9.15 The Rademacher optimism penalty is 


1 
Eout (Gr) = Ein (gr) = 2 = Ein(gr). 


Let H have growth function m(NV). Show that, with probability at least 1— ô, 


oat il 2mn(N) 
< — In ———.. 
Rademacher optimism penalty < oN In 5 


[Hint: For a single hypothesis gives P||Eout(h) — Ein(h)| > €] < gf Ne J 


Problem 9.16 [Hard] Rademacher Generalization Bound. The 
Rademacher optimism penalty bounds the test error for binary classification (as 
does the VC-bound). This problem guides you through the proof. 


We will need McDiarmid’s inequality (a generalization of the Hoeffding bound): 


Lemma 9.1 (McDiarmid, 1989). Let X; € A; be independent random 
variables, and Q a function, Q:[[ A: > R, satisfying 


sup |Q(x) — q(a1,...,%j-1, 2, Lj 41,---,En)| < Cj, 
sE; A; 
zEÁj 


forj =1,...,n. Then, t > 0, 


P[Q(%1,..., Xn) — E[Q(X1,-¢., Xn)] > t] < exp (=) , 


Corollary 9.2. With probability at least 1 — 6, 


QKit., Xn) — ERX,- Xn) 





(a) Assume that the hypothesis set is symmetric (h € H implies —h € H). 
Prove that with probability at least 1 — ô, 


N 


1 2 
< E; =y — log =. (9. 
Eout(g) < Fin(g) + Exp feas N PD rnh(Xn)| + log 5 (9.8) 





To do this, show the following steps. 
(i) Eou (g) < Ein(g) + max{ Eou (h) — Ein(h)}. 


(i) max{ Eou (h)-Ein(h)} = $ max { OM, yhen) — Ex f(x) h(x) f. 


(iii) Show that 3 max {x EiL Ynh(xn) — Ex[f(x)h(x)]} is upper 
bounded with probability at least 1 — 6 by 














b En {max {$ Eia vhlan) = Exh} + v/a lou? 
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[Hint: Use McDairmid's inequality with 


Q(x1, +. XN, Y1- YN) = max {x ITL, ynh(xn) — Ex[f(x)h(x)]} 


Show that perturbing data point (Xn, Yn) changes Q by at most =.] 


(iv) Let D’ = (x1, y1), ---, (xy, yy) be an independent (ghost) data set. 
Show that 


Ex[f(x)h(x)] = $ Ena Ev lyn h(n). 


Ly 
Hence, show that max { W ona Ynh(Xn) — Ex [P(x)h(x)]} equals 














max {En |$ Dr (Uhlan) — yahOx,))] 


hEH 
By convexity, max; {E|-]} < E[max;,{-}], hence show that 
Al N 
hee {x eae Ynh(Xn) -N Ex [F(x)h(x)]} 
1 N inil 1 
Conclude that 


N 
Eout(g) < Ein(g)+4Ep, p pax + 2 (ynh(xn) z yahh) +4/ zy log §. 


The remainder of the proof is to bound the second term on the RHS. 





(v) Let ri,...,7~ be arbitrary +1 Rademacher variables. Show that 





toio [man {F D8 (unh(xe) = yahh) 








E Ena ax {x Ehi Pn (Ynh(Xn) — yahh) 
< Ot, pas { ISN rayahan) } : 


[Hint: Argue that rn = —1 effectively switches xn with xi, which is 
just a relabeling of variables in the expectation over xn, x. For the 
second step, use max{A — B} < max|A| + max|B| and the fact 
that H is symmetric.] 

(vi) Since the bound in (v) holds for any r, we can take the expectation 
over independent r,, with P[r,, = +1] = 4. Hence, show that 


Boor [mas {4 ETa lunhan) = val) } 














< 2B» [mas {4 OM rah), 


hEH 


and obtain (9.8). [Hint: what is the distribution of rnyn ?] 
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(b) In part (a) we obtained a generalization bound in terms of twice the 
expected Rademacher optimism penalty. To prove Theorem 9.6, show 
that this expectation can be well approximated by a single realization. 

. = 1 N 
(i) Let Q(m,...,7N,X1,---;Xn) = max {4 bray rnh(xn)} . Show 
that if you change an input of Q, its value changes by at most =. 


(ii) Show that with probability at least 1 — ô, 


1 N 1 N 2 
Eno mar {OM rahn} < max {4 Eia hon ey 


aly 


(iii) Apply the union bound to show that with probability at least 1 — ô, 


Eout(g) < Ein(g) + max { We rah(xn)} +4/ In. 


Problem 9.17 [Hard] Permutation Generalization Bound. Prove 
that with probability at least 1 — 6, 


Eout(g) < Ein(g) + Ex [maxnen a yK h(xn)] + Ol, / + log 4). 


The second term is similar to the permutation optimism penalty, differing by 
Y Ex [Gx], which is zero for balanced data. 


[Hint: Up to introducing the rn, you can follow the proof in Problem 9.16; now 
pick a distribution for r to mimic permutations. For some helpful tips, see “A 
Permutation Approach to Validation,” Magdon-Ismail, Mertsalov, 2010.] 


Problem 9.18 Permutation Penalty for Linear Models. For 
linear models, the predictions on the data are y‘") = Hy‘), H is the hat 
matrix, H(A) = X(X7™X + AI)~1X*, which is independent of m. For regression, 
the permutation optimism penalty from (9.7) is 2 DAL (Yan — ¥)9(m) (Xn). 
(we do not divide the squared error by 4 for regression). 


(a) Show that for a single permutation, permutation penalty is: 


























ae: 
HD do Hmn(YamYan — Tren): 
m,n=1 
2, 2 = 
(b) Show that: Ex[yx,] = J, and Ex [Yam Yan] = J w 3 T 
=y MEM 


2 are defined in Problem 9.12(b).) 


Y 
(c) Take the expectation of the penalty in (a) and prove Equation (9.6): 


(J and s 


a2 


; a4 26y 1S1 
permutation optimism penalty = N trace(S) — wd? 





where ô? = wos is the unbiased estimate of the target variance. 
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Problem 9.19 Repeat Problem 9.18 for the Bootstrap optimism penalty 
and show that 


2 


2s 
Bootstrap optimism penalty = -y trace(H). 


Compare this to the permutation optimism penalty in Problem 9.18. 


[Hint: How does Problem 9.18(b) change for the Bootstrap setting ?] 


Problem 9.20 This is a repeat of Problem 5.2. Structural Risk Min- 
imization (SRM) is a useful framework for model selection that is related to 
Occam's Razor. Define a structure — a nested sequence of hypothesis sets: 


The SRM framework picks a hypothesis from each H; by minimizing Fin. 


That is, gi = argmin Ein(h). Then, the framework selects the final hy- 
hEHi 
pothesis by minimizing Ein and the model complexity penalty Q. That is, 


g* = argmin(Fin(gi) + Q(H:)). Note that Q(H;) should be non-decreasing in i 
i=1,2,- 


because of the nested structure. 


(a) Show that the in-sample error Ein (gi) is non-increasing in i. 


(b) Assume that the framework finds g* € H; with probability p;. How does 
pi relate to the complexity of the target function? 


(c) Argue that the p;'s are unknown but po < pi < po <- <1. 
(d) Suppose g* = g;. Show that 


* 1 =ë 
P [| Ein (9:) = Bout(gi)| > € | 9° = gi] < 5 Ama (2N) N. 


Here, the conditioning is on selecting g; as the final hypothesis by SRM. 
[Hint: Use the Bayes theorem to decompose the probability and then 
apply the VC bound on one of the terms] 


You may interpret this result as follows: if you use SRM and end up with gi, 
then the generalization bound is a factor — worse than the bound you would 
have gotten had you simply started with Hi. 
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Problem 9.21 Cross-Validation Leverage for Linear Regres- 
sion.] In this problem, compute an expression for the leverage defined in 
Equation (9.4) for linear regression with weight decay regularization. We will 
use the same notation from Problem 4.26, so you may want to review that 
problem and some of the tools developed there. 


To simulate leaving out the data point (Zm, Ym), set the mth row of Z and 
the mth entry of y to zero, to get data matrix Z() and target vector y(”), 
so Dm = (ZO) OY. We need to compute the cross validation error for 
this data set Eev(Dm). Let Ho be the hat matrix you get from doing linear 
regression with the data Z™® , y(™, 


(m) _ 7 
(a) Show that Eew(Dm) = cathe § y (2 z) i 


(b) Use the techniques from Problem 4.26 to show that 


Hmn Hmk 


H™® =H, . 


(c) Similarly, show that 


gim) _ 9 Ym =m \ 


[Hint: Use part (c) of Problem 4.26.] 
(d) Show that E.y(Dm) is given by 





1 (tact (Pat) Hom) (n= tm)" 


N-1 n, an +R NT Enn 





(e) Give an expression for the leverage m. What is the running time to 
compute the leverages 41,..., lN. 


Problem 9.22 The data set for Example 9.3 is 


1 0.51291 0.36542 
1 0.46048 0.22156 
1 0.38504 0.15263 
X= |1 0.095046 y = |0.10355 
1 0.43367 0.10015 
1 0.70924 0.26713 
1 0.11597 2.3095 


Implement the algorithm from Problem 9.21 to compute the leverages for all 
the data points as you vary the regularization parameter X. 


Give a plot of the leverage of the last data point as a function of A. Explain 
the dependence you find in your plot. 














© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:9—47 


e-9. LEARNING AIDES 9.6. PROBLEMS 


Problem 9.23 For a sample of 500 digits data (classifying “1” versus 
“not 1”), use linear regression to compute the leverage of the data points (even 
though the problem is a classification problem). You can use the algorithm in 
Problem 9.21. 


Give a similar plot to Figure 9.5 where you highlight in a black box all the data 
points with the top-10 largest (positive) leverages. 


Give some intuition for which points are highlighted. 


Problem 9.24 The Jackknife Estimate. The jackknife is a general 
statistical technique used to reduce the bias of an estimator, based on the 
assumption that the estimator is asymptotically (as N increases) unbiased. Let 
Z = Z1,...,Zy bea sample set. In the case of learning, the sample set is the 
data, so Zn = (Xn, Yn). We wish to estimate a quantity t using an estimator 
Ê(Z) (some function of the sample Z). Assume that Êĉ(Z) is asymptotically 
unbiased, satisfying 

ai a2 
N'M 
The bias is O (+). Let Zn = %1,...,Zn—1,Zn;Zn+1,---,Zn be the leave one 
out sample sets (similar to cross validation), and consider the estimates using 
the leave one out samples tn = t(Za): 


Ez(é(Z)| = t+ +e 


sa aı a2 
(a) Argue that Ez[tn] = t + W-1 + (N— 1)? 


(b) Define # = Nt(Z) — (N — 1)fn. Show that Ez[fn] = t- yey + 
O(+z)- (7n has an asymptotically asymptotically smaller bias than ¢(Z).) 


The 7, are called pseudo-values because they have the “correct” expec- 
tation. A natural improvement is to take the average of the pseudo- 
values, and this is the jackknife estimate: #;(Z) = qe Ti. = 
Na(Z) = Met DN i. 

(c) Applying the Jackknife to variance estimation. Suppose that the sample 
is a bunch of independent random values 71,...,2~ from a distribution 
whose variance o” we wish to estimate. We suspect that the sample vari- 


2 
2 15N 12 1 N . 
ance, 8° = Ș Xpn=1 in — WT (> na ta) , should be a good estimator, 


i.e., it has the assumed form for the bias (it does). Let s7, be the sample 
variances on the leave one out samples. Show that 


2 1 X a2 2 1 a i 
Sn = FI DE — Wt? X tm — Tn : 
m= 


m=1 


Hence show that jackknife estimate Ns? — AS a s2 is s = ws’, 


which is the well known unbiased estimator of the variance. In this partic- 
ular case, the jackknife has completely removed the bias (automatically). 
(d) What happens to the jackknife if t(Z) has an asymptotic bias? 


(e) If the leading order term in the bias was oa does the jackknife estimate 
have a better or worse bias (in magnitude)? 
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Problem 9.25 The Jackknife for Validation. (see also the previous 
problem) If Ein is an asymptotically unbiased estimate of Fout, we may use the 
jackknife to reduce this bias and get a better estimate. The sample is the data. 
We want to estimate the expected out-of-sample error when learning from N 
examples, Eout( N). We estimate Eout (N) by the in-sample error Ein(g®). 


(a) What kind of bias (+ve or -ve) will Ein have. When do you expect it to 
be asymptotically unbiased? 


(b) Assume that the bias has the required form: Ep[Ein(g?)] = Eour(N) + 
ka + = +... . Now consider one of the leave one out data sets Dn, 
which would produce the estimates Ein(g'?”)). Show that: 


i. the pseudo-values are N Ein(g P?) — (N — 1)Ein(g'?™); 














N 
ii. the jackknife estimate is: Ey = NEin(g)) — A+ > Ein(g?”). 


n=1 


(c) Argue that Ep[Ein(g'?”)] = Eour(N — 1) + gH + E +--+, and 


hence show that the expectation of the jackknife estimate is given by 


1 
Ep[EJ] = Eour(N) + (N — 1) (Eon (N) — Eor (N — 1)) + O (=) 7 
(d) If the learning curve converges, having a form Eou(N) = E + a + 2 + 
++, then show that Ep[Ez] = Eout (N) + a +0 (=z) : 
The jackknife replaced the term © in the bias by a. (In a similar vein to 


cross validation, the jackknife replaces the bias of the in-sample estimate 
with the bias in the learning curve.) When will the jackknife be helpful? 


Problem 9.26 The Jackknife Estimate for Linear Models. The 
jackknife validation estimate can be computed analytically for linear models. 
Define the matrix H® = H? — H, and let y = Hy. 


(a) Show that the jackknife estimate is given by 


_1yNn (n-yn)?2 _ 2È Gn-yn) _ HO Gn-yn)? 
pAr I-Hnn 1-Hnn OH) - (9.9) 


[Hint: you may find some of the formulas used in deriving the cross 
validation estimate for linear models useful from Problem 4.26.] 

(b) When à = 0, what is Ez? Compare to Eev in (4.13). Show that Ein < 
Ey < Ew. [Hint: H is a projection matrix, so H? = H when à = 0; also, 
Han = xt (XTX) !xn, so show that 0 < Hnn < 1 (show that for any 
positive definite (invertible) matrix A, X7 (xn Xx}, + A)~'xn <1).] 
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Problem 9.27 Feature Selection. Let x € R? and let H be the 
perceptron model sign(w’x). The zero norm, ||w||,, is the number of non- 
zero elements in w. Let Hy = {w : ||w||, < k}. The hypothesis set Hy 
contains the classifiers which use only k of the dimensions (or features) to 
classify the input. Model selection among the Hx corresponds to selecting the 
optimal number of features. Picking a particular hypothesis in Hg corresponds 
to picking the best set of k features. 





(a) Show that order selection is a special case of feature selection. 


(b) Show that dvo(Hk) < max(d + 1, (2k + 1) log, d) = O(k log d), which 
for k = o(d) is an improvement over the trivial bound of d + 1. 
[Hint: Show that dy; < M for any M which satisfies 2¢ > M*+! (¢); 
to show this, remember that Hx is the union of smaller models (how 
many? what is their VC-dimension?), and use the relationship between 
the VC-dimension of a model and the maximum number of dichotomies 
it can implement on any M points. Now use the fact that (2) < (22) 7 


Problem 9.28 Forward and Backward Subset Selection (FSS 
and BSS) For selecting k features, we fix Hpg, and ask for the hypothesis 
which minimizes Fin (typically k < d). There is no known efficient algorithm 
to solve this problem (it is a NP-hard problem to even get close to the minimum 
Ein). FSS and BSS are greedy heuristics to solve the problem. 


FSS: Start with k = 1 and find the best single feature dimension i1 yielding 
minimum Ein. Now fix this feature dimension and find the next feature 
dimension i2 which when coupled with the feature dimension i; yields 
minimum Lin. Continue in this way, adding one feature at a time, until 
you have k features. 


BSS: Start with all d features and now remove one feature at a time until you 
are down to k features. Each time you remove the feature which results 
in minimum increase in Fin. 


(a) Which of FSS and BSS do you expect to be more efficient. Suppose 
that to optimize with 7 features and N data points takes O(N) time. 
What is the asymptotic run time of FSS and BSS. For what values of k 
would you use FSS over BSS and vice versa. [Hint: show that FSS takes 
NY$ i(d +1 — i) time and BSS takes NY% f(d — i)(d +1 — i).] 

(b) Show that H1, H2,..., Ha are a structure. (See Problem 9.20.) 


(c) Implement FSS. Use the bound on the VC-dimension given in the Prob- 
lem 9.27 to perform SRM using the digits data to classify 1 (class +1) 
from {2, 3,4, 5,6,7,8,9,0} (all in class —1). 


(d) What are the first few most useful features? What is the optimal number 
of features (according to SRM)? 
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